Gaussian Process Models

Lecture 14

Dr. Colin Rundel

Multivariate Normal

Multivariate Normal Distribution

For an \(n\)-dimension multivariate normal distribution with covariance \(\boldsymbol{\Sigma}\) (positive semidefinite) can be written as

\[ \underset{n \times 1}{\boldsymbol{y}} \sim N(\underset{n \times 1}{\boldsymbol{\mu}}, \; \underset{n \times n}{\boldsymbol{\Sigma}}) \]

\[ \begin{pmatrix} y_1\\ \vdots\\ y_n \end{pmatrix} \sim N\left( \begin{pmatrix} \mu_1\\ \vdots\\ \mu_n \end{pmatrix}, \, \begin{pmatrix} \rho_{11}\sigma_1\sigma_1 & \cdots & \rho_{1n}\sigma_1\sigma_n \\ \vdots & \ddots & \vdots \\ \rho_{n1}\sigma_n\sigma_1 & \cdots & \rho_{nn}\sigma_n\sigma_n \\ \end{pmatrix} \right) \]

Density

For the \(n\) dimensional multivate normal given on the last slide, its density is given by

\[ (2\pi)^{-n/2} \; \det(\boldsymbol{\Sigma})^{-1/2} \; \exp{\left(-\frac{1}{2} \underset{1 \times n}{(\boldsymbol{y}-\boldsymbol{\mu})'} \underset{n \times n}{\boldsymbol{\Sigma}^{-1}} \underset{n \times 1}{(\boldsymbol{y}-\boldsymbol{\mu})}\right)} \]

and its log density is given by

\[ -\frac{n}{2} \log 2\pi - \frac{1}{2} \log \det(\boldsymbol{\Sigma}) - -\frac{1}{2} \underset{1 \times n}{(\boldsymbol{y}-\boldsymbol{\mu})'} \underset{n \times n}{\boldsymbol{\Sigma}^{-1}} \underset{n \times 1}{(\boldsymbol{y}-\boldsymbol{\mu})} \]

Sampling

To generate draws from an \(n\)-dimensional multivate normal with mean \(\underset{n \times 1}{\boldsymbol{\mu}}\) and covariance matrix \(\underset{n \times n}{\boldsymbol{\Sigma}}\),

  • Find a matrix \(\underset{n \times n}{\boldsymbol{A}}\) such that \(\boldsymbol{\Sigma} = \boldsymbol{A}\,\boldsymbol{A}^t\)

    • most often we use \(\boldsymbol{A} = \text{Chol}(\boldsymbol{\Sigma})\) where \(\boldsymbol{A}\) is a lower triangular matrix.
  • Draw \(n\) iid unit normals, \(N(0,1)\), as \(\underset{n \times 1}{\boldsymbol{z}}\)
  • Obtain multivariate normal draws using \[ \underset{n \times 1}{\boldsymbol{y}} = \underset{n \times 1}{\boldsymbol{\mu}} + \underset{n \times n}{\boldsymbol{A}} \, \underset{n \times 1}{\boldsymbol{z}} \]

Bivariate Examples

\[ \boldsymbol{\mu} = \begin{pmatrix}0 \\ 0\end{pmatrix} \qquad \boldsymbol{\Sigma} = \begin{pmatrix}1 & \rho \\ \rho & 1 \end{pmatrix}\]

Marginal / conditional distributions

Proposition - For an \(n\)-dimensional multivate normal with mean \(\boldsymbol{\mu}\) and covariance matrix \(\boldsymbol{\Sigma}\), any marginal or conditional distribution of the \(y\)’s will also be (multivariate) normal.

Univariate marginal distribution: \[ y_i = N(\boldsymbol{\mu}_i,\,\boldsymbol{\Sigma}_{ii}) \]

Bivariate marginal distribution: \[ \boldsymbol{y}_{ij} = N\left( \begin{pmatrix}\boldsymbol{\mu}_i \\ \boldsymbol{\mu}_j \end{pmatrix},\; \begin{pmatrix} \boldsymbol{\Sigma}_{ii} & \boldsymbol{\Sigma}_{ij} \\ \boldsymbol{\Sigma}_{ji} & \boldsymbol{\Sigma}_{jj} \end{pmatrix} \right) \]

\(k\)-dimensional marginal distribution:

\[ \boldsymbol{y}_{i,\ldots,k} = N\left( \begin{pmatrix}\boldsymbol{\mu}_{i} \\ \vdots \\ \boldsymbol{\mu}_{k} \end{pmatrix},\; \begin{pmatrix} \boldsymbol{\Sigma}_{ii} & \cdots & \boldsymbol{\Sigma}_{i k} \\ \vdots & \ddots & \vdots \\ \boldsymbol{\Sigma}_{k i} & \cdots & \boldsymbol{\Sigma}_{k k} \end{pmatrix} \right) \]

Conditional Distributions

If we partition the \(n\)-dimensions into two pieces such that \(\boldsymbol{y} = (\boldsymbol{y}_1,\, \boldsymbol{y}_2)^t\) then

\[ \underset{n \times 1}{\boldsymbol{y}} \sim N\left( \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix},\, \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix} \right) \] \[ \begin{aligned} \underset{k \times 1}{~~\boldsymbol{y}_1~~} &\sim N(\underset{k \times 1}{~~~\boldsymbol{\mu}_1~~~},\, \underset{k \times k}{~~~\boldsymbol{\Sigma}_{11}~~~}) \\ \underset{n-k \times 1}{\boldsymbol{y}_2} &\sim N(\underset{n-k \times 1}{\boldsymbol{\mu}_2},\, \underset{n-k \times n-k}{\boldsymbol{\Sigma}_{22}}) \end{aligned} \]

then the conditional distributions are given by

\[ \begin{aligned} \boldsymbol{y}_1 ~|~ \boldsymbol{y}_2 = \boldsymbol{a} ~&\sim N(\boldsymbol{\mu_1} + \boldsymbol{\Sigma_{12}} \, \boldsymbol{\Sigma_{22}}^{-1} \, (\boldsymbol{a} - \boldsymbol{\mu_2}),~ \boldsymbol{\Sigma_{11}}-\boldsymbol{\Sigma_{12}}\,\boldsymbol{\Sigma_{22}}^{-1} \, \boldsymbol{\Sigma_{21}}) \\ \\ \boldsymbol{y}_2 ~|~ \boldsymbol{y}_1 = \boldsymbol{b} ~&\sim N(\boldsymbol{\mu_2} + \boldsymbol{\Sigma_{21}} \, \boldsymbol{\Sigma_{11}}^{-1} \, (\boldsymbol{b} - \boldsymbol{\mu_1}),~ \boldsymbol{\Sigma_{22}}-\boldsymbol{\Sigma_{21}}\,\boldsymbol{\Sigma_{11}}^{-1} \, \boldsymbol{\Sigma_{12}}) \end{aligned} \]

Gaussian Processes

From Shumway,

A process, \(\boldsymbol{y} = \{y(t) ~:~ t \in T\}\), is said to be a Gaussian process if all possible finite dimensional vectors \(\boldsymbol{y} = (y_{t_1},y_{t_2},...,y_{t_n})^t\), for every collection of time points \(t_1, t_2, \ldots , t_n\), and every positive integer \(n\), have a multivariate normal distribution.


So far we have only looked at examples of time series where \(T\) is discete (and evenly spaces & contiguous), it turns out things get a lot more interesting when we explore the case where \(T\) is defined on a continuous space (e.g. \(\mathbb{R}\) or some subset of \(\mathbb{R}\)).

Gaussian Process Regression

Parameterizing a Gaussian Process

Imagine we have a Gaussian process defined such that \(\boldsymbol{y} = \{y(t) ~:~ t \in [0,1]\}\),

  • We now have an uncountably infinite set of possible \(t\)’s and \(y(t)\)s.
  • We will only have a (small) finite number of observations \(y(t_1), \ldots, y(t_n)\) with which to say something useful about this infinite dimensional process.
  • The unconstrained covariance matrix for the observed data can have up to \(n(n+1)/2\) unique values\(^*\)
  • Necessary to make some simplifying assumptions:

    • Stationarity

    • Simple(r) parameterization of \(\Sigma\)

Covariance Functions

More on these next week, but for now some common examples

Exponential covariance: \[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-|t-t'| \; l\,\big) \]

Squared exponential covariance (Gaussian): \[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-(|t-t'| \; l\,)^2\big) \]

Powered exponential covariance \(\left(p \in (0,2]\right)\): \[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-(|t-t'| \; l\,)^p\big) \]

Covariance Function - Correlation Decay

Letting \(\sigma^2 = 1\) and trying different values of the length scale \(l\),

Correlation Decay - AR(1)

Recall that for a stationary AR(1) process: \(\gamma(h) = \sigma^2_w \phi^{|h|}\) and \(\rho(h) = \phi^{|h|}\)

we can draw a somewhat similar picture about the decay of \(\rho\) as a function of distance.

Example

Prediction

Our example has 20 observations \(\left(\boldsymbol {y}_{obs} = \left(y(t_1), \ldots, y(t_{20})\right)'\right)\), which we would like to use as the basis for predicting \(y(t)\) at other values of \(t\) (say a regular sequence of values from 0 to 1), \(\left( \boldsymbol {y}_{pred} \right)\).

For now lets use a squared exponential covariance with \(\sigma^2 = 10\) and \(l=15\), as such the covariance is given by:

\[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-|t-t'| \; l\,\big) \]

Our goal is then to obtain samples from \(\boldsymbol{y}_{pred} | \boldsymbol{y}_{obs}\) which has the following distribution,

\[\boldsymbol{y}_{pred} ~|~ \boldsymbol{y}_{obs} = \boldsymbol{y} ~\sim N(\boldsymbol{\Sigma}_{po} \, \boldsymbol{\Sigma}_{obs}^{-1} \, \boldsymbol{y},~ \boldsymbol{\Sigma}_{pred}-\boldsymbol{\Sigma}_{po}\,\boldsymbol{\Sigma}_{pred}^{-1} \, \boldsymbol{\Sigma}_{op})\]

Squared exponential covariance

Exponential covariance

Now lets consider an exponential covariance model instead where \(\sigma=10\), \(l = \sqrt{15}\),

\[ \Sigma(y_{t},y_{t'}) = \sigma^2 \exp\big(-|t-t'| \; l\,\big) \]

We are still sampling from \(\boldsymbol{y}_{pred} ~|~ \boldsymbol{y}_{obs}\), all that has changed is the values of the covariance matrices.

What “paths” do we get with this covariance? How are they similar and how are they different from the squared exponential covariance?

Exponential Covariance

Powered exponential covariance (\(p=1.5\))

Back to the square exponential

Changing the lengthscale (\(l\))

Effective range

For the square exponential covariance \[ \begin{aligned} Cov(d) &= \sigma^2 \exp\left(-(l \cdot d)^2\right) \\ Corr(d) &= \exp\left(-(l \cdot d)^2\right) \end{aligned} \]

we would like to know, for a given value of \(l\), beyond what distance apart must observations be to have a correlation less than \(0.05\)?

\[ \begin{aligned} \exp\left(-(l \cdot d)^2\right) &< 0.05 \\ -(l \cdot d)^2 &< \log 0.05 \\ l \cdot d &< \sqrt{3} \\ d &< \sqrt{3} / l \end{aligned} \]

Changing the scale (\(\sigma^2\))

Fitting w/ BRMS

library(brms)
gp = brm(y ~ gp(t), data=d, cores=4, refresh=0)
Running /opt/homebrew/Cellar/r/4.3.1/lib/R/bin/R CMD SHLIB foo.c
using C compiler: ‘Apple clang version 15.0.0 (clang-1500.0.40.1)’
using SDK: ‘MacOSX14.0.sdk’
clang -I"/opt/homebrew/Cellar/r/4.3.1/lib/R/include" -DNDEBUG   -I"/Users/rundel/Library/R/arm64/4.3/library/Rcpp/include/"  -I"/Users/rundel/Library/R/arm64/4.3/library/RcppEigen/include/"  -I"/Users/rundel/Library/R/arm64/4.3/library/RcppEigen/include/unsupported"  -I"/Users/rundel/Library/R/arm64/4.3/library/BH/include" -I"/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/src/"  -I"/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/"  -I"/Users/rundel/Library/R/arm64/4.3/library/RcppParallel/include/"  -I"/Users/rundel/Library/R/arm64/4.3/library/rstan/include" -DEIGEN_NO_DEBUG  -DBOOST_DISABLE_ASSERTS  -DBOOST_PENDING_INTEGER_LOG2_HPP  -DSTAN_THREADS  -DBOOST_NO_AUTO_PTR  -include '/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp'  -D_REENTRANT -DRCPP_PARALLEL_USE_TBB=1   -I/opt/homebrew/opt/gettext/include -I/opt/homebrew/opt/readline/include -I/opt/homebrew/opt/xz/include -I/opt/homebrew/include    -fPIC  -g -O2  -c foo.c -o foo.o
<built-in>:1:10: fatal error: '/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp' file not found
#include "/Users/rundel/Library/R/arm64/4.3/library/StanHeaders/include/stan/math/prim/fun/Eigen.hpp"
         ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1 error generated.
make[1]: *** [foo.o] Error 1
summary(gp)
 Family: gaussian 
  Links: mu = identity; sigma = identity 
Formula: y ~ gp(t) 
   Data: d (Number of observations: 20) 
  Draws: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
         total post-warmup draws = 4000

Gaussian Process Terms: 
            Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sdgp(gpt)       1.60      0.63     0.80     3.28 1.02      282      374
lscale(gpt)     0.12      0.03     0.08     0.17 1.01      249      310

Population-Level Effects: 
          Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
Intercept    -1.01      0.79    -2.82     0.45 1.01     1118     1075

Family Specific Parameters: 
      Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
sigma     0.19      0.05     0.11     0.32 1.01      941     1403

Draws were sampled using sampling(NUTS). For each parameter, Bulk_ESS
and Tail_ESS are effective sample size measures, and Rhat is the potential
scale reduction factor on split chains (at convergence, Rhat = 1).

Some notes

gp() serves a similar function to arma() within a brms model - it specifies the structure of the model which is then translated into stan.

  • The covariance function is specified using the cov argument, currenly only "exp_quad" is the default and only supported covariance.

  • The covariance parameterization used differs slightly from what was given previous (but is equivalent),

\[ k(x_i, x_j) = \text{\{sdgp\}}^2 \exp(−||x_i − x_j||^2 / (2 \,\text{\{lscale\}}^2)) \]

  • Model fitting scales very poorly in terms of \(n\) (this is a stan issue and not just a brms issue)

Trace plots

plot(gp)

PP Checks

pp_check(gp, ndraws = 100)

Model predictions

Forecasting

brms paths

Why does this look different?

Unlike our previous conditional samples these predictions no longer pass exactly through each observation and so we get a bit of a better behaved curve and much more reasonable intervals.

The model being fit by brms has one additional parameter that our previous example did not have - a nugget covariance (sigma in the brms summary).

So what does the covariance look like?

\[ k(x_i, x_j) = \text{\{sdgp\}}^2 \exp(−||x_i − x_j||^2 / (2 \,\text{\{lscale\}}^2)) + \text{\{sigma\}}^2 \mathbb{1}_{i=j} \]

Squared exponential w/ nugget (\(\sigma^2_w = 0.1\))